Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numpy parquet faster #21

Closed
wants to merge 4 commits into from
Closed

Numpy parquet faster #21

wants to merge 4 commits into from

Conversation

rom1504
Copy link
Owner

@rom1504 rom1504 commented Apr 19, 2022

faster but didn't solve memleak

computing safety predictions on top of clip embeddings
@rom1504
Copy link
Owner Author

rom1504 commented Apr 20, 2022

Did solve memleak but this current mutex + dict implementation is complex and not very reliable
Instead handle this in the loader function of the iteration : prepare file-> table mapping in advance
And clean up when we don't need them using reference counting

@rom1504
Copy link
Owner Author

rom1504 commented Apr 20, 2022

That or remove the parallism completely from the parquet reading and instead do a simpler loader using the precomputed pieces to know what slices to read
Worth benchmarking a few solutions for the parquet alone

@Veldrovive
Copy link
Contributor

It might also be valuable to take into account down the line parallelization such as torch dataset workers. Maybe sequential parquet reading would be fine in that case.

@rom1504
Copy link
Owner Author

rom1504 commented May 15, 2022

done in another pr

@rom1504 rom1504 closed this May 15, 2022
@rom1504 rom1504 deleted the numpy_parquet_faster branch May 15, 2022 23:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants